Intelligent RDD Management for High Performance In-Memory Computing in Spark
نویسندگان
چکیده
Spark is a pervasively used in-memory computing framework in the era of big data, and can greatly accelerate the computation speed by wrapping the accessed data as resilient distribution datasets (RDDs) and storing these datasets in the fast accessed main memory. However, the space of main memory is limited, and Spark does not provide an intelligent mechanism to store reasonable RDDs in the limited memory. In this paper, we propose a fine-grained RDD checkpointing and kick-out selection strategy, by which Spark can intelligently select the reasonable RDDs to maximize the memory usage. The experiment is conducted on a server with four nodes. Experimental results demonstrate that the proposed techniques can effectively accelerate the execution speed.
منابع مشابه
Neutrino: Revisiting Memory Caching for Iterative Data Analytics
In-memory analytics frameworks such as Apache Spark are rapidly gaining popularity as they provide order of magnitude performance speedup over disk-based systems for iterative workloads. For example, Spark uses the Resilient Distributed Dataset (RDD) abstraction to cache data in memory and iteratively compute on it in a distributed cluster. In this paper, we make the case that existing abtracti...
متن کاملGeoSpark: A Cluster Computing Framework for Processing Spatial Data
This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three nove...
متن کاملEfficient In-memory Data Management: An Analysis
This paper analyzes the performance of three systems for in-memory data management: Memcached, Redis and the Resilient Distributed Datasets (RDD) implemented by Spark. By performing a thorough performance analysis of both analytics operations and fine-grained object operations such as set/get, we show that neither system handles efficiently both types of workloads. For Memcached and Redis the C...
متن کاملNovel Apache Spark based Algorithm to Solve Dirichlet Problem for Poisson Equation in 3D Computational Domain
Corresponding Author: Shomanov Aday Department of Computer Science, al-Farabi Kazakh National University, Almaty, Kazakhstan Email: [email protected] Abstract: Parallel computations are essential tool in solving large-scale computationally demanding problems. Due to large diversity and heterogeneity of the currently available parallel processing techniques and paradigms it is usually diff...
متن کاملThe Effects of Spark Training on Visual-Spatial Working Memory Operation in Children with Mental Retardation
Introduction: Mental retarded children who receive a wide range of health services, representing more than two percent of the population. Mental retardation is associated with significant constraints on mental performance and adaptive behavior as well as perceptual and practical skills. According to the studies, one of the important tools that can affect cognitive abilities, such as memory, is ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017